A novel method for Pu-erh tea face traceability identification based on improved MobileNetV3 and triplet loss

Ensuring the traceability of Pu-erh tea products is crucial in the production and sale of tea, as it is a key means to ensure their quality and safety. The common approach used in traceability systems is the utilization of bound Quick Response (QR) codes or Near Field Communication (NFC) chips to track every link in the supply chain. However, counterfeiting risks still persist, as QR codes or NFC chips can be copied and inexpensive products can be fitted into the original packaging. To address this issue, this paper proposes a tea face verification model called TeaFaceNet for traceability verification. The aim of this model is to improve the traceability of Pu-erh tea products by quickly identifying counterfeit products and enhancing the credibility of Pu-erh tea. The proposed method utilizes an improved MobileNetV3 combined with Triplet Loss to verify the similarity between two input tea face images with different texture features. The recognition accuracy of the raw tea face dataset, ripe tea face dataset and mixed tea face dataset of the TeaFaceNet network were 97.58%, 98.08% and 98.20%, respectively. Accurate verification of tea face was achieved using the optimal threshold. In conclusion, the proposed TeaFaceNet model presents a promising approach to enhance the traceability of Pu-erh tea products and combat counterfeit products. The robustness and generalization ability of the model, as evidenced by the experimental results, highlight its potential for improving the accuracy of Pu-erh tea face recognition and enhancing the credibility of Pu-erh tea in the market. Further research in this area is warranted to advance the traceability of Pu-erh tea products and ensure their quality and safety.


Scientific Reports
| (2023) 13:6986 | https://doi.org/10.1038/s41598-023-34190-z www.nature.com/scientificreports/ value. Many unscrupulous enterprises and individuals sell seconds at best quality prices, which seriously affects the Pu-erh tea sales market, can mislead consumers and negatively affect the economic benefits to consumers 4 . To improve traceability and combat counterfeiting, various technological solutions have been proposed. For instance, a traceability system that uses bound Quick Response (QR) codes or Near Field Communication (NFC) chips could trace every link of the supply chain 5 . But, digital ID-based solutions cannot completely solve the problem of counterfeiting, as counterfeiters can still copy QR codes or NFC chips and fit cheaper products into the original packaging. One important way to enhance product traceability is to extract and use information about the unique and natural characteristics of the product 6 . In the case of Pu-erh tea, the different and unique natural textures formed when tea is compressed into cakes can be used as the basis for tea face images.
Computer vision technology has made it possible to use deep learning and image processing methods for biometric identification, including face recognition 7,8 . Many face recognition models and methods have been developed, such as DeepFace 9 , SphereFace 10 , central loss 11 , state-of-the-art face recognition models 12 , and LocalFace 13 . Similar methods have also been used in animal feature recognition tasks, such as automatic identification of individual cows 14 and goats 15 , pig face recognition 16 , cow face recognition 17,18 , and individual egg identification 19 . We therefore speculated that biometric approaches could also be applied to the Pu-erh tea face recognition task.
The tea face recognition task can be divided into two types: tea face verification and tea face recognition. To improve the traceability of Pu-erh tea products, we proposed a Pu-erh tea face verification model, TeaFaceNet, based on an improved MobileNetV3. The model uses an attention mechanism module ECA block in the lightweight network MobileNetV3 for feature extraction to express texture features while reducing the number of parameters. Triplet Loss and Softmax are used as the loss function. Our experimental results showed that the validation accuracy of the model was higher than that of some classical convolutional neural networks (CNNs) models. Constructing a verification model can improve the traceability of Pu-erh tea and help avoid adulteration.

Materials and methods
Data acquisition. The image data for this study were collected from a Pu-erh tea cake production plant in Puer city, Yunnan Province, China (22.78°N, 100.91°E). Two types of equipment were used to photograph each tea cake: a mobile phone (HONOR 50) and a High-Speed photographic apparatus (Eloam High-Speed Portable HD DocScanner S820A3AF). The purpose was to simulate real-world scenarios, and a schematic diagram of the image acquisition process is shown in Fig. 1. The Eloam High-Speed Portable HD DocScanner S820A3AF has CMOS Autofocusing technology with a 10 million pixel main camera that captures images at a resolution of 3264 × 2448. The HONOR 50 is a mobile phone released by HONOR on June 16, 2021, equipped with 108 + 8 + 2 + 2 million pixels quad cameras. The resolution of the images acquired by the mobile phone is 3904 × 2928. A total of 200 pieces of Pu-erh raw tea and 200 pieces of Pu-erh ripe tea were collected, with 100 pieces used for the training dataset and the other 100 pieces for the test dataset. Each tea cake was photographed from the front and back. The image shooting standards are as follows: (1) set off with a white background, keep the background clean and tidy without debris; (2) shoot at a distance of 20 cm directly above the tea cake; (3) ensure that the tea cake is in the center of the image; (4) make the tea cake maximally filled with pictures to ensure a clear texture.  www.nature.com/scientificreports/ Preprocessing. After the data acquisition was completed, the tea cake image was processed uniformly and the resolution of the tea cake map was adjusted to 320 × 320 × 3. The images were then expanded using data enhancement techniques. After the above operations, the following three training datasets were established: Pu-erh raw tea face dataset; Pu-erh ripe tea face dataset; and mixed tea face dataset. All three datasets include the front and back images of Pu-erh raw tea and Pu-erh ripe tea. Some of the Pu-erh tea face datasets are shown in Fig. 2. The amount of data for each training data set is shown in Table 1. The training dataset of Pu-erh raw tea faces contains 100 front and back images of Pu-erh raw tea cakes captured using two types of equipment, resulting in a total of 400 images. After applying data augmentation techniques, the total number of images increased to 8000. Similarly, the training dataset of Pu-erh ripe tea faces contains 100 front and back images of Pu-erh ripe tea cakes taken using two different devices, resulting in a total of 400 images. After data augmentation, the total number of images increased to 8000. The mixed tea face dataset contains all the raw and ripe Pu-erh tea faces from the previous datasets, resulting in a total of 800 images. After data augmentation, the total number of images increased to 16,000. During the training process, the dataset was split into training set and validation set in a 9:1 ratio. The training set and validation set for Pu-erh raw tea face dataset and Pu-erh ripe tea face dataset contained 7200 and 800 images respectively, while for Mixed tea face dataset, they contained 14,400 and 1600 images respectively. www.nature.com/scientificreports/ The test dataset was shot with the same shooting method of 100 pieces each of Pu-erh raw tea and Pu-erh ripe tea, containing both front and back images, as shown in Table 2. Among them, 1200 test pairs (600 pairs of the same tea face and 600 pairs of different tea face) were selected for each of the Pu-erh raw tea face test dataset and the Pu-erh ripe tea face test dataset, and 2400 test pairs (1200 pairs of the same tea face and 1200 pairs of different tea face) were selected for the mixed tea face dataset.
Data enhancement. When photographing the tea cake, it is difficult to determine a fixed direction due to its round shape. To improve the robustness of the deep neural network for tea face recognition in various scenes, we used rotation, flipping, random contrast and brightness adjustments, image noise, and random erasing to enhance the data. This data augmentation technique enriches the dataset and improves the generalization of the model, allowing it to learn enough features to enhance its performance. The data enhancement techniques are illustrated in Fig. 3.
Image rotation. Firstly, image enhancement was performed using rotation. Rotate the original image by 45°, 90°,135°, 180°, 225°, 270°, and 315° while performing one mirror flip. This was done so that the model could learn the features of each angle and improve the rotation invariance of the model. Image noise. In terms of image noise, Salt-and-pepper noise and Gaussian noise were used to enhance the image data. Salt-and-pepper noise is a very important noise, which mainly changes pixels to black and white randomly 20 . Compared with other noises, images are more sensitive to salt-and-pepper noise. Gaussian noise, which is a noise whose distribution obeys a normal distribution, is superimposed on every point of the image. Using these two methods to enhance the image could improve the ability of the model to mine the deep features of the image and enhance the recognition performance of the model in complex scenes.
Image brightness, chroma, contrast, sharpness. In terms of image brightness adjustment, the following measures were used to enhance the data. Adjusts the brightness of the original image by selecting three random values, and these three random values were constrained to a range, namely Value min = 0.5 Value min = 0.5 and Value max = 2.0 Value max = 2.0 Value max = 2.0 Value max = 2.0 . In the image chromaticity, contrast, and sharpness adjustment, the same measures were taken to enhance the data. After the enhancement adding the images to the training set, the main purpose of this enhancement method is that it can simulate the situation under different light intensities when the tea face was taken. Also, the data processed by this method could make up for the shortcomings of the neural network and make it more robust when testing the data under different light intensities. Image random erasing. Zhun Zhong et al. 21 proposed a random erasure method for training CNNs that randomly selects rectangular regions in an image while modifying their pixels using random values. By using this method, images with different occlusion levels could be generated, which could reduce the risk of overfitting and make the model robust to occlusion. Firstly, MobileNetV3 used depthwise separable convolution, which was designed to reduce the amount of computation and improve the computational speed of the network. Depthwise separable convolution mainly includes depthwise convolution and pointwise convolution. The depthwise convolution was to change the convolution kernel in the standard convolution into a single-channel convolution kernel. When the input had N  www.nature.com/scientificreports/ number of channels, there will be N single-layer convolution kernels, and each channel was convolved separately and finally superimposed. Pointwise convolution was used to expand the channels by using 1 × 1 convolution. A comparison with standard convolution is shown in Fig. 4a and b. Secondly, MobileNetV3 used linear bottleneck, Expansion layer and Inverted residuals. The linear bottleneck was used to reduce the loss of feature information, and the inverted residuals were used to learn more features by expanding the channels. The residual block was by descending and then ascending, while the inverted residual block was by ascending and then descending. Figure 4c shows the residual blocks, and Fig. 4d shows the inverted residuals and linear bottlenecks.

Lightweight network
Finally, MobileNetV3 placed the lightweight attention model of the squeeze and excitation structure after the depth filter in the extension in order to facilitate the application of attention to the largest representation. Figure 5 shows the structure of the MobileNetV3 block and a new activation Attention mechanism module. Attention mechanisms were essentially a set of weighting coefficients learned autonomously by the network and "dynamically weighted" to emphasize regions of interest while sup- ReLU6(x+3) 6   Firstly, the squeeze and excitation(SE) block, which was the main representative of channel attention. This attention mechanism module was used in MobileNetV3. The SE block is shown in Fig. 6a. It is mainly composed of two parts: squeeze and excitation. Secondly, the convolutional block attention module (CBAM) 26 was used in this experiment, which was based on the original channel attention and bridged with a spatial attention module (SAM). Figure 6b shows the structure of the CBAM module.
The structure of the Efficient Channel Attention (ECA) block 27 is shown in Fig. 6c. It used a 1-dimensional sparse convolution operation to optimize the fully connected layer operations involved in the SE block to significantly reduce the number of parameters and maintain a comparable performance. In order to compress the number of parameters and improve the computational efficiency, the SE block adopts a "dimensionality reduction-then dimensionality increase" strategy, using two multilayer perceptrons to learn the correlation between different channels, i.e., each current feature map interacts with other feature maps, which is an intensive connection. ECA module simplifies this connection by making the current channel interact with its k domain channels only, aggregated features are obtained by global average pooling (GAP), and ECA generates channel weights by performing a fast 1D convolution of size k , where k is determined adaptively by mapping the channel dimension C . The k is shown in Eq. (2).
where |t| odd represents the odd number nearest to t . γ and b are set to 2 and 1.
Proposed model architecture. TF-Bottleneck block. In this paper, a TeaFaceNet bottleneck (TF-Bottleneck) block was proposed. This module improved the MobileNetV3 block. Figure 7a shows the inverted residuals block. This block mainly uses ReLU as the activation function. Figure 7b shows the TF-Bottleneck block. The attention block of the ECA module is placed after the depth filter in the extension to facilitate the application of attention to the maximum representation.
Backbone feature extraction network. TeaFaceNet feeds each batch of data into a redesigned deep convolutional neural network and then performs L2 normalization to produce embeddings of tea faces. Both triplet loss and softmax loss are used in training the data, which is eventually used for the tea face verification task. The training structure of the TeaFaceNet model is shown in Fig. 8.
The specifications of the backbone feature extraction network in this paper are shown in Table 3. The initial input size is adjusted to 320 × 320 × 3, and the final output is a 1 × 1 × 128 feature vector. The entire backbone    www.nature.com/scientificreports/ Layer13, Layer14, Layer15, Layer16} are linear TF-Bottleneck layers, in which the ECA module is added and h-swish is used as the activation function. {Layer17} is the Flatten layer, the main purpose of this layer is to flatten the features, which is the transition from the convolutional layer to the fully-connected layer. {Layer18} is a fully-connected neural network layer, whose main purpose is to fully connect the input into a 128-dimensional feature vector.
Loss function. Triplet Loss 28 is chosen as the main loss function. The main objective is to minimize the Euclidean distance between an anchor and a positive image and maximize the Euclidean distance from a negative image, as shown in Fig. 9. The minimized triplet loss function is shown in Eq. (4), where a increases the distance gap between positive and negative pairs. T is the set of all possible triples in the training set with base N. Meanwhile, softmax loss 11 is added to the training. Because using only Triplet Loss, the convergence of the model is too slow, which is due to the fact that using triples to select data generates a large number of data sets and the random sampling method is used for selection, which leads to a reduced model training speed. The softmax loss function is shown in Eq. (5), Among them, x i ∈ R d denotes the i th deep feature, belonging to the y i th class. d is the feature dimension. W j ∈ R d denotes the j th column of the weights W j ∈ R d×n in the last fully connected layer and b ∈ R n is the bias term. The size of the mini-batch and the number of class is m and n.
Tea face verification process. Tea face verification mainly involves inputting two images to be recognized into the trained TeaFaceNet network to extract the depth features of the images and finally form two feature vectors, which are then mapped to a compact Euclidean space. The L2 distance between them directly represents the similarity gap between the two tea faces, and the verification result is derived based on the similarity gap threshold, i.e. whether it is the same tea face or not. The specific process of tea face verification is described below, the process is shown in Fig. 10.
y i x i +b y i n j=1 e W T j x i +b j Figure 9. Triplet Loss.

Results and discussion
Experimental environment and parameter settings. The experiments were conducted in Python.
The code was mainly based on the Keras deep learning framework. TensorFlow was used as the backend. The hardware and software configuration pieces of information are shown in Table 4. The hyperparameters for model training are shown in Table 5. www.nature.com/scientificreports/ Tea face recognition results. A test dataset was used to evaluate the TeaFaceNet model. Table 6 shows tea face verification results. The TeaFaceNet was compared with several other mainstream network models, including ResNet50 29 , VGG16 30 , Inception-ResNet-v1 31 , MobileNet and MobileNetV3. Among them, Mobile-NetV3 had the best recognition effect among the mainstream network models. The recognition accuracy of the raw tea face dataset, ripe tea face dataset and mixed tea face dataset of the TeaFaceNet network were 97.58%, 98.08% and 98.20%, respectively. TeaFaceNet network adds the ECA attention mechanism module to the use of depthwise separable convolution and linear bottlenecks, and the accuracy achieves better results in all three datasets, improving by 1.92%, 2.42% and 0.54% in the three datasets, respectively. The recognition accuracy was improved by replacing the attention mechanism module and redesigning the network structure. In terms of size in the model, TeaFaceNet was only second to MobileNet. The recognition accuracy was improved by 4%, 3% and 1% in the three datasets.
TeaFaceNet not only had the best accuracy in the raw tea dataset, mature tea dataset and mixed dataset but also converged first during the model training. A better results could be achieved when the model is trained to 100 epochs. The variation of loss values and validation set accuracy of different network models on the raw tea dataset, ripe tea dataset and mixed dataset are shown in Fig. 11, Fig. 12 and Fig. 13, respectively.
All tests deal with two main types of problems, i.e., distinguishing between similar tea faces and dissimilar tea faces. Therefore, each model needs to be tested with an optimal threshold. The experiments focus on determining the optimal threshold for each model used ten-fold cross-validation. Table 7 shows the optimal thresholds for all models. The role of the threshold was to determine whether the two tea faces are the similarity. When greater than the optimal threshold, it means that the two tea faces are dissimilar, when less than the optimal threshold, it means that the two tea faces are similar. Figure 14 shows the validation case of the TeaFaceNet model. Where    Table 8 shows the Precision, Recall and F1-Score of the model on the test sets of the raw tea face dataset, ripe tea face dataset and mixed dataset. The experiments showed that TeaFaceNet could be implemented and achieved excellent results on the Pu-erh tea face verification task. Through the analysis of the receiver operating characteristic (ROC) curve, the quality of the network model could be better determined. The Area Under roc Curve (AUC) value is the size of the part of the area under the ROC curve. The AUC value is between 0.5 and 1.0, with a larger AUC representing better performance. The higher the upper left corner, the better the performance. Figure 15 shows the ROC curves of the model for each   Effect of attentional mechanism module on the model. To investigate the effect of the attention mechanism module on the model, experiments were conducted by replacing the ECA module in the model with the SE module and the CBAM module. Table 9 shows the results of tea face recognition under different attention mechanism modules. It was shown experimentally that a better results were achieved using the ECA module with the least number of model size. The accuracy increased over the model using the SE module was 0.83%, 0.33%, and 0.25% for the three data sets, and the model size volume was reduced by 5.8 M. The accuracy improvement over the model using the CBAM module was 1.25%, 4.92%, and 2% for the three data sets, and the model size volume was reduced by 72.1 MB. The features between channels in the tea face recognition task had a large impact on the results. It was proven that the ECA module could effectively improve the accuracy of network verification.

Discussion
In this work, We propose a Pu-erh tea face verification approach called TeaFaceNet based on an improved MobileNetV3 to enhance Pu-erh tea traceability identification. We construct three types of Pu-erh tea face datasets and establish a Pu-erh tea face verification network to achieve comprehensive verification of Pu-erh raw tea and Pu-erh ripe tea. The TeaFaceNet network achieved recognition accuracies of 97.58%, 98.08%, and 98.20% for the raw tea face dataset, ripe tea face dataset, and mixed tea face dataset, respectively. However, several issues remain in the area of tea face recognition. There is currently no publicly available dataset for Pu-erh tea faces, and the dataset used in this experiment needs further expansion. Our work solely addresses the Pu-erh tea face verification problem, and further exploration is required for the Pu-erh tea face recognition problem.  www.nature.com/scientificreports/ In practical applications, transportation breakage can also pose a challenge, and more discussion is needed for the verification and identification of Pu-erh tea faces after breakage.

Conclusion
The primary objective of this study was to address the challenge of tracing Pu-erh tea cakes and to facilitate the detection of counterfeit and substandard tea products. In this paper, we proposed a Pu-erh tea face verification model, TeaFaceNet, based on an improved MobileNetV3 architecture. The TeaFaceNet model extracts 128-dimensional features from each pair of Pu-erh tea face images and calculates the L2 distance between them to determine whether they are the same tea face, based on the similarity between images determined by the best threshold. The experimental results demonstrated that the TeaFaceNet model outperformed other models on the Pu-erh tea face dataset. The ECA block reduced the model size while extracting features, thereby improving the recognition rate of the network. The proposed model exhibited better robustness and generalization ability and achieved excellent results not only on individual class tea face verification tasks but also on mixed datasets. Our approach could serve as an empirical basis for subsequent Pu-erh tea face recognition tasks and aid in enhancing the traceability of Pu-erh tea products.